{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>\n",
    "<center><font face=\"黑体\" size=4>《机器学习基础实践》课程实验指导书</font></center>\n",
    "<br>\n",
    "<center><font face=\"黑体\",size=4>第7章  聚类</font></center>\n",
    "\n",
    "$\\textbf{1.实验目标}$\n",
    "\n",
    "了解聚类算法的基本概念、训练方法、实现方法，并实现部分聚类算法。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\textbf{2.实验内容}$\n",
    "\n",
    "$\\textbf{7.1 聚类的基本概念}$\n",
    "\n",
    "在分类和回归任务中，训练样本都有标记信息，学习的目标在于从训练数据中学习得到预测模型，为未见样本预测正确的标记信息。与分类和回归任务不同，在聚类任务中，训练样本的标记信息是未知的，学习的目标是通过对无标记的训练样本的学习来揭示数据的内在规律，为进一步的数据分析提供基础。\n",
    "\n",
    "聚类试图将数据集中的样本划分为若干个互不相交的子集，每个子集称为一个“簇”，每个簇可能对应一些潜在的概念。假定样本集合$D=\\{\\textbf{x}_1,\\textbf{x}_2,…,\\textbf{x}_m\\}$包含$m$个无标记的样本，每个样本$\\textbf{x}_i=[x_{i1},x_{i2},…,x{id}]$是一个$d$维的特征向量，聚类算法将样本集$D$划分为$k$个互不相交的簇$\\{C_l\\}_{l=1}^{k}$，用$\\lambda_j$表示样本$\\textbf{x}_j$的簇标记，聚类的结果可以用包含$m$个元素的簇标记向量$\\boldsymbol{\\lambda}=[\\lambda_{1},\\lambda_{2}...,\\lambda_{m}]$表示每个元素表示对应样本的簇标记。\n",
    "\n",
    "聚类的基本思想是将样本空间中相似的样本划分到同一个簇。因此，如何度量样本空间中任意两个样本之间的相似性，称为聚类的关键。在聚类任务中，通常采用距离来度量任意两个样本的相似性。常见的数据样本之间的距离计算方法如下所示。\n",
    "\n",
    "$\\textbf{(1)欧氏距离：}$ 适用于连续属性，\n",
    "\n",
    "\\begin{equation}\n",
    "dist(\\textbf{x}_i,\\textbf{x}_j)=\\sqrt {\\sum_{k=1}^{d}(x_{ik}-x_{jk})^2}\n",
    "\\end{equation}\n",
    "\n",
    "$\\textbf{(2)曼哈顿距离：}$适用于连续属性，\n",
    "\n",
    "\\begin{equation}\n",
    "dist(\\textbf{x}_i,\\textbf{x}_j)=\\sum_{k=1}^{d}|(x_{ik}-x_{jk})|\n",
    "\\end{equation}\n",
    "\n",
    "$\\textbf{(3)马氏距离：}$适用于连续属性，\n",
    "\n",
    "\\begin{equation}\n",
    "dist(\\textbf{x}_i,\\textbf{x}_j)=\\sqrt{(\\textbf{x}_i-\\textbf{x}_j)\\Sigma^{-1}(\\textbf{x}_i-\\textbf{x}_j)}\n",
    "\\end{equation}\n",
    "\n",
    "$\\textbf{(4)余弦相似度：}$适用于连续属性，\n",
    "\n",
    "\\begin{equation}\n",
    "dist(\\textbf{x}_i,\\textbf{x}_j)=\\frac{\\textbf{x}_{i}^{T}\\textbf{x}_j}{\\sqrt{\\textbf{x}_{i}^{T}\\textbf{x}_i}\\sqrt{\\textbf{x}_{j}^{T}\\textbf{x}_j}}\n",
    "\\end{equation}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\textbf{7.2 k均值聚类算法}$\n",
    "\n",
    "给定样本集$D=\\{\\textbf{x}_1,\\textbf{x}_2,…,\\textbf{x}_m\\}$，$k$均值($k$-means)算法将样本划分为$k$个簇$C=\\{C_1,C_2,…, C_k\\}$使得如下所示的平方误差最小，\n",
    "\n",
    "\\begin{equation}\n",
    "E = \\sum_{i=1}^{k}\\sum_{\\textbf{x}\\in C_j}\\Vert \\textbf{x}-\\boldsymbol{\\mu_i}\\Vert_{2}^{2}\\\\\n",
    "\\boldsymbol{\\mu}_i=\\frac{1}{|C_i|}\\sum_{\\textbf{x} \\in C_i}\\textbf{x}\n",
    "\\end{equation}\n",
    "\n",
    "其中，$\\boldsymbol{\\mu}_{i}$是簇$C_i$的均值向量。由于最小化上式所示的平方误差需要列举所有可能的簇划分，即$k$的所有可能取值，求解的难度非常大、计算代价太高，$k$均值聚类算法采用贪心策略，通过迭代优化来近似求解。$k$均值聚类算法的流程如下所示：\n",
    "\n",
    "$\\textbf{k均值聚类算法的流程}$\n",
    "\n",
    "$\\textbf{输入：}$\n",
    "\n",
    "样本集$D=\\{\\textbf{x}_1,\\textbf{x}_2,…,\\textbf{x}_m\\}$，聚类个数$k$.\n",
    " \n",
    "$\\textbf{输出：}$\n",
    "\n",
    "样本的簇标记向量$\\boldsymbol{\\lambda}=[\\lambda_{1},\\lambda_{2}...,\\lambda_{m}]$.\n",
    "\n",
    "$\\textbf{过程：}$\n",
    "\n",
    "(1)从训练集$D$中随机挑选$k$个不同的样本作为初始的簇均值向量$\\{\\boldsymbol{\\mu}_1,\\boldsymbol{\\mu}_2,...,\\boldsymbol{\\mu}_k\\}$.\n",
    "\n",
    "(2)初始化所有的簇为空:$C_i=\\phi$.\n",
    "\n",
    "(3)for $i=1:m$:\n",
    "\n",
    "(4)计算样本$\\textbf{x}_i$与各簇均值向量$\\boldsymbol{\\mu}_j$的距离$d_{ji},j=1,2,..,k$.\n",
    "\n",
    "(5)将样本$\\textbf{x}_i$的簇标记设置为距离最小的类簇的标记$\\lambda_{j}$，将样本$\\textbf{x}_i$放入簇$C_j$中.\n",
    "\n",
    "(6)end for\n",
    "\n",
    "(7) for $j=1:k$:\n",
    "\n",
    "(8) 重新计算各个簇的中心$\\boldsymbol{\\mu}_j^{*}=\\frac{1}{|C_j|}\\sum_{\\textbf{x} \\in C_j}\\textbf{x}$.\n",
    "\n",
    "(9) if $\\boldsymbol{\\mu}_j^{*}\\ne \\boldsymbol{\\mu}_j$\n",
    "\n",
    "(10) 更新各个簇的中心$\\boldsymbol{\\mu}_j=\\boldsymbol{\\mu}_j^{*}$.\n",
    "\n",
    "(11) 如果所有簇均值向量均为更新，则停止，否则转入步骤3，继续执行.\n",
    "\n",
    "(12) reutrn $\\boldsymbol{\\lambda}=[\\lambda_{1},\\lambda_{2}...,\\lambda_{m}]$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\textbf{7.3 k-mean聚类算法的实现}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#此处添加代码，构造KMeans类，封装k_mean算法的实现"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#此处添加代码，利用实现的KMeans类，对Iris数据集进行聚类，并可视化聚类结果"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\textbf{7.4 学习向量量化算法}$\n",
    "\n",
    "学习向量量化(Learning Vector Quatization，简称LVQ)试图找到一组原型向量来刻画聚类结构，与一般的聚类算法不同，LVQ假设数据有标记信息，学习过程中使用标记信息来辅助聚类。学习向量量化算法的流程如下所示：\n",
    "\n",
    "$\\textbf{学习向量量化算法流程：}$\n",
    "\n",
    "$\\textbf{输入：}$\n",
    "\n",
    "样本集$D=\\{\\textbf{x}_1,\\textbf{x}_2,…,\\textbf{x}_m\\}$；原型向量个数$q$，各原型向量预设的类别标记$\\{C_1,C_2,…,C_q\\}$；学习速率$\\eta \\in (0,1)$.\n",
    "\n",
    "$\\textbf{输出：}$\n",
    "\n",
    "$q$个原型向量$\\{\\textbf{p}_1,\\textbf{p}_2,…,\\textbf{p}_q\\}$.\n",
    "\n",
    "$\\textbf{过程：}$\n",
    "\n",
    "(1)初始化一组原型向量$\\{ \\textbf{p}_1,\\textbf{p}_2,…,\\textbf{p}_q\\}$.\n",
    "\n",
    "(2) repeat:\n",
    "\n",
    "(3)从样本集$D$中随机选取样本$(\\textbf{x}_j,y_j)$,计算样本与每个原型向量$\\textbf{p}_i$的距离$d_{ji}$.\n",
    "\n",
    "(4)找出与样本$(\\textbf{x}_j,y_j)$距离最近的原型向量$\\textbf{p}^{*} \\in \\{\\textbf{p}_1,\\textbf{p}_2,…,\\textbf{p}_q\\}$.\n",
    "\n",
    "(5)if $y_i$和原型向量$\\textbf{p}^{*}$的类别标记相等:\n",
    "\n",
    "(6)$\\textbf{p}=\\textbf{p}^{*}+\\eta(\\textbf{x}_i-\\textbf{p}^{*})$.\n",
    "\n",
    "(7) else:\n",
    "\n",
    "(8) $\\textbf{p}=\\textbf{p}^{*}-\\eta(\\textbf{x}_i-\\textbf{p}^{*})$.\n",
    "\n",
    "(9) 将原型向量$\\textbf{p}^{*}$更新为$\\textbf{p}$.\n",
    "\n",
    "(10) untill 满足停止条件.\n",
    "\n",
    "(11) return $q$个原型向量$\\{\\textbf{p}_1,\\textbf{p}_2,…,\\textbf{p}_q\\}$.\n",
    "\n",
    "$\\textbf{7.5 学习向量量化算法的实现}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "class LVQ:\n",
    "    #计算两个向量的欧氏距离\n",
    "    def euclidean_dist(self, x1, x2):\n",
    "    #此处添加代码，计算两个向量x1和x2的欧式距离dist\n",
    "    return dist\n",
    "    #选择与样本最近的原型向量的编号\n",
    "    def get_min_dist_proto_vect(self, proto_vects,x):\n",
    "        min_dist = np.inf\n",
    "        min_index = -1\n",
    "        #添加代码，在所有的原型向量proto_vects，中寻找与样本x最近的原型向量的索引min_index\n",
    "        return min_index    \n",
    "    #LVQ训练\n",
    "    def fit(self,data_x,data_y,proto_vects, proto_labels,learnrate,max_iters):\n",
    "        \"\"\"\n",
    "        参数：\n",
    "        data_x: 训练集特征数据\n",
    "        data_y: 训练样本的类别标记\n",
    "        proto_vects: 初始化的原型向量\n",
    "        proto_labels: 原型向量对应的类别标记\n",
    "        learnrate:更新速率\n",
    "        max_iters: 最大迭代次数\n",
    "        \"\"\"\n",
    "        iters = 0\n",
    "        while iters < max_iters:\n",
    "            #随机选择一个训练样本sample_index\n",
    "            sample_index = np.random.choice(len(data_x), 1, replace=False)\n",
    "            min_dist_index = self.get_min_dist_proto_vect(proto_vects, data_x[sample_index])\n",
    "            #根据样本的类别来更新原型向量\n",
    "            if proto_labels[min_dist_index] == data_y[sample_index]:\n",
    "                #添加代码，更新原型向量\n",
    "            else:\n",
    "                #添加代码，更新原型向量\n",
    "            iters += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "生成一个具有三个类簇的数据集，使用以上实现的学习向量量化算法进行原型聚类。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(1)生成一个具有三个类簇的数据集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\utils\\deprecation.py:143: FutureWarning: The sklearn.datasets.samples_generator module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.datasets. Anything that cannot be imported from sklearn.datasets is now part of the private API.\n",
      "  warnings.warn(message, FutureWarning)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<matplotlib.collections.PathCollection at 0x18d09d89520>"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAALwAAACCCAYAAAD8OaJ2AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nO2dd3hUVfrHP+dOn2RS6SC9CRgbgoogWBAriis2sMtvV9F1m2XV1V3dZW3rspZVdi3LLhZEAVEQ7IgIAgIiUkMJKSSE9GTavff8/pgkJsxMMkkmmUlyP8+TB2buvee8M/c7557yvu8RUkoMDDoLSqwNMDBoSwzBG3QqDMEbdCoMwRt0KgzBG3QqDMEbdCrMsai0S5cusn///rGoulMhpUQIEWsz2pxNmzYVSim7hjoWE8H379+fjRs3xqLqDo+UkqUvfMR///QOFUUVJHdN4ubHr+XCW8+NtWlthhDiYLhjMRG8QevxwcureOX+BXiqvAAU55fywi9fxWKzcN6MCTG2LvYYffgOxn//+E6t2GvwVvn4zyNvx8ii+MIQfAdC0zSK80tDHjty6GgbWxOfGILvQJhMJroelx7yWK/BPdrYmvjEEHwH47a/Xo/Naa33ns1h5fYnZsTIovjCGLR2MM65djwWm5XXHnyD/INH6DWoB7f99XrGXnxqrE2LC1oseCHEccB8oAegA/OklHNbWq5B8xk/bSzjp42NtRlxSTRaeBX4jZTyOyGEC9gkhPhYSvljFMo2MIgqLe7DSynzpJTfVf+/HNgB9G5puQYGrUFU+/BCiP7AycD6aJZr0HJ8Hh9rl26gIKuQoaMHceLEkZ3S7SBqghdCJALvAvdIKctCHJ8FzALo27dvtKo1iICcvXncc9bDeKu8+Dx+LHYLAzP68eTHD2Nz2GJtXpsSlWlJIYSFgNgXSCnfC3WOlHKelHK0lHJ0164h/XoMWok518+ltLAMd4UHTdXwVHjY+90+3n5yaUTXe91eSo6UIqVE13UaioPWNI2v3lvPnBlzmXvHPHZvyozWx4gK0ZilEcArwA4p5d9abpJBNCktLGPf1oNIvb5IfR4/K1/7nBsemV77Xo2QD/yQxbvPfsChXblUVXjI3pULgBAC1a9iMpsYP20ss5+/laQ0V+31mqbx4MVz2L52F54KD4oiWPWfL5jx0M/42W8uxWK1tMEnbphodGnGATOBbUKILdXv/V5KuTwKZXdqyosrmP/oQlYvWofZbGLKredw9X2XY7VFLhxdlxCmq15VVoWUksrSKl685zW+eHstql+tPV3XQ7fkqk9l9aJ17Fi/h39vfxabPbDQ9fXib9n+9U48ld7a631uP68++CYLHn+XqbOncMtfrsNkMkVsf7RpseCllGsI+5UaNBef18/ssQ9QkFWI6lMBeOuJJWz7agdPrHo4ogGnpmls/nQbZqsZv1cNOu6u9LLk+RV88NIqcvYeRvNrAESSuEVTNQ7vL+CShOtJ6ZqMt8qL1+1D1/SQ53vdPpa+8BG6Lvm/p26IoIbWwXAtiFNWv/MNxYdLasUO4HP72LFuN7s27G30eiklj057mmdnvYS73BPyHM2v8c97XidrR06t2JuMhJKCUtwVnrBir8Fb5WPZP1fi8/qbV1cUMAQfp2xfuwt3RbBQdV2ye+O+Rq/f8vkPbPnsh9ruRTjaOhGX1CUVxRVtWmddDMHHKb2H9MDqsAa9bzIrdO/f+CzXtys246kM3bLHEovNQnKXpJjVbwg+Tpl8w0TMlvqDO8Wk4EpLZPQFJwadn3/wCI9c8SQXO69jasoNbF+7E8UUf0MrZ7KDu05/gLefXIK7wk1lWRWVZVVtVr+IRW7J0aNHSyOmtXH2btnPEzc8T87uXKSUHH/GMO7/7110O65LvfMqSyu5aejdlBVV1PajFbOCrjbcp44lFpsFxSRQfSpCCIacOoj75s+m9+CeLS5bCLFJSjk65DFD8PFPaWEZJrOJxJSEkMcX/+ND/v3AAnzu2A0GW4pQBEnpLv63/0XszoZXf30eH163j8SUhJCzVQ0J3ujStAOSuySFFTvAtjU72rXYITCY9bl9fLVoXdhzqsrd/OX6uUxNuZHpPW/npmF3s/WL7U2qxxB8B6C0oDzWJkQFd4WH3MzDYY8/Ou1J1ry3DtWnovpUcvce5qFL55C1MyfiOgzBdwCK80tibULUeG/uhxzYfijo/ew9efy4dnfQAprP4+fdZz+IuHxD8B2A1B4psTYhalSVuXnk8ieD1gcO7y/AbA12DNA13WjhOxN+n5+CrMJYmxFVig4Xk7Uju957A07oi88TPE6x2MyMGjcs4rINwbdzVr+zjpKC0Llo2itCEUHiTu+ZyvkzJ2CrM4OjKAKb08YVd18UcdlG1oJ2zsaPtzbqPtDekLpk4In9gt7/5Uuz6DuiD4v/sZzK0ipOOS+D2+ZcT1qP1IjLNgTfzunaOw2z1YTqa6bzVxyiazoFWUdZvWgtR7IKOWHCSMZdfhpmi5kr77mEK++5pNllGwtP7Zy8/fncfsKv8Vb5Ym1K1LA5rUgJSInP48eRaKfnoO78/avHcCQ6Gr3eWHjqwPQc0J1HFv2WpC6uxk9uJ3jdPnxuX20/3l0ddbXob5FPP4bDEHwH4LQpJ7Mw719ccPMkLE2IhopLBIgQ8UQ+j5/P31zT4uKNPnw7RlM11i//jv3fZ9FrcA/u+PtNuCs8rF70TWRhS3GEyaygaxKTxYSUMmRASqh5+KZiCL6dUl5cwS/HPURh9lHclR5MJgVdl6R2T0EI0eaBHS1F1yVSylrvSQT1frQ2p5WLZ53X4noMwbdT/n3/AvL25deGAGrVrsBFecWxNKvZ1M2qUPNjdSTa0TQNIRROnZzBJf83ucX1GIJvp3y5cG29eNeOhjPJwbUPTMPpcjDizKEMPmlAVMo1BG8Ql6h+jbOnn0HPAd2jWq4xS9NOOXv6mU0exJktJpyuxuex4wGzxUR6z8hXUCPFEHw75dY515HWMzXiuNXxV53OgqyX+M0rv2DmH65qZetajupTWfrCR1Ev1xB8O2P9h5uYPfZ+ru41i8Lso0Ep9OoihKDHgG6cOHEkG5ZvZuaAO/ji7a+5aNZ5PPXZI/QZ1gvF1LYSiLQ+n8fPx/O/jH79US/RoNVY8eqnPHb1s+zakInqU9E1nXCzj0IR/GnpvTiTHPy4dheeykDm4K+XbGD2mPsZPmYIr+2Yy0r/23zke4vJN05s9fTZVruF3kN6YE+MLGNxa/wYDcG3EzRV4+XfzsdbFZlnpNPlwGQxk5eZj7/ObI6u6VSWVfHlwrW175nMJn732p28vPWpkLlwWkpNgLbJbMLvU3Ek2CNaEc6YOCLqthiCbydsWLmFypLI87fcN/8usnflooZYsfRUeNm75UDQ+wNG9ePpzx4ltXtyS0yth81hxea0UVlaibvCw+F9BVSWuekzrCc9BnRrsBXf+93+qNlRgyH4doCUkrm/mBfx+U6XA6vdQp9hvTBbgzP12hNsDBgVelOK48cO4Y1DL3HmZaNbnCI3Kd3FKednoPnV2oUxCOTIzNmdxzOfP8pTnz0SVvQ5u/NaZkAIDMG3Aw7+mE1FcWXE50spcaUlcur5GXTtk14vg5miCOwJdiZdc2bIa3d+u4fr+/2CzZ//gD3BjtkisFg1Auv8kbsrKCaFlO7JeN2+kJmLzVYzezcfYPiYIWG7UQMygoNAWooh+A6GUASpPVIYcspAFEXh2dWPMe6KsZgtJhSTwinnn8hz6/4S0q/cXeHm/smPU5RXgrvcg6fCg+oPiNxia5pvTiCIo5DElARM5uCnjKbqdO/fFavNwjX3TsWeUH8ga3NaufGPVzepzkiIykqrEGIKMBcwAf+WUv41GuXGC1vzD/Pa5k3kVpQzoW8/ZmacTLLd3mb19xvRh6R0V1Aon81pY8xFJ7H+g82YraZq57Fk5qx4sHbGJSndxUNv/QopA85ZihK+jVuz+Ft0PTg9n9+n0Jz+jaIITpw4kvUffoem/jSWMFtN9B/Zh0En9gfgugevxJWWyJtzFlNypIwBo/ry87/dyPFjhzS5zsaIxpY3JuAF4HwgG9gghHi/o+zTumzXTu77dCVeVUUC2/LzWbDtez68biZpDmeb2CCE4A+Lfsu95/0RXdPxuX1YHVZGjhvO7xfcg7fKy471e3GlJjB09KBw6ecanXYsP1qB6oteBjO/V2Xc5WPod3wfnr7lRY7mFSOl5NTJJ3Lv63fWs+2yO6Zw2R1TolZ3OKLRwo8B9kop9wEIId4CpgLtXvB+TePhzz/Bo/7UB/VqKkVuN/M2beD+s85uM1uGjR7EG1kv8eXCbyjOL2HUWcPJmDACIQTmZDOjJwdnFG4qGeO7opj8RKOna0+wcfGs80jvmUp6z1TmZz5PcX4JNqeNhKS2aShCEQ3B9wbqporKBuJi3/P9JcXMXbeWTXk59Eh0ccdpY5nUf2DE1+8tLkKTIR7xusbKzD2c3qcvfZOTGZiaFk2zw5KQ5OSi285ttfIHDv2SM6eU8c1KF56qQL/b7tQwmSWVZSYa69aYrQH/l5SuVq6YfRqTZlxTe0wI0aTsAq1FNAQf6lsIGuG09T6t+4qLmPrWAtyqH11KcsrLmb18GQ9PmMQ1ozIiKiPZZkMN0acFyCot5c7l7yOBU3v24qWLp5JgrT/bUOb18Ny36/hg905MQuHKESO5Y/RYbOY4dVJV93Hvcwf5YkkKH72RhqYJzr+qmBGjy7nnsqFUlplrgzNCuTRknF7Jn9/MRFFUYDUceR6Z+jzCGjKeOiZEY5YmGziuzus+QO6xJ7X1Pq1/X7e2Vuw1uFWVOWtW49ciS2nRy5XECd16YA7R95XV5XlUlW9zsnn0y8/qHfdrGj9b+Cb/3bqZ/MpKcivKmbdpIzcsWRQX0UhSSqRvA9K9DKlWL/BYTkFRbJwzrYQnF+3jmcWZTLmuiF4D/IwcU4nFbuHC287hwTd/hc1Z/8ed3gMefT0TRZSCrKz+K0IW34bUg/apjhnRaGo2AEOEEAOAHOAa4LoolNsiNubl1BN7DZquk1teTr+U8PkYs0pLOFRWys7CQk7r1ZsSj5uc8sBNc6vBc8p+XWfZ7p3MOXcy5upZkFWZe8mtKMdX5wnh1VS2HylgY14Op/XqE9HnkHol+L4FYQbrWISIbOlfaofBvw2UbmDJqDdglVoBsmgm6PkEmms/0noGJP0Fql4FXQWqd/SToJjgwZcPUlSg0mvQV6BsxLXwFB6fuRXVL1H9Gtf9xozVbgaO/X4keFaAs+EpRin94K/e9dRyEoG9rqNPNLatVIUQs4GVBKYlX5VSNi1pdyvQI8HF4YrgzbNUqfPZgUzWZ2czIDWVWaeMJrV6tuVgSQk//3Ape48WUvMMEIDVZOKSIcPYX1rCd3lBD69AubqOX9NqBb8lP48qf/CMh6rpbMvPj0jwunsFlN4PomYeW0DKiwhb+CGSlBJZ/jhUvUXgdkhQekP6fxCmQDCFLPk1aFlAnSed70soHA8Js0DNAm9gm10hAn92h6RXv2xQAzkfTzplDe/sGkN2wWMkdUklxfk/ZMX6EAb5QG84u7H0rkeWzK5jjwIp/0DYQi+OtYQOm4hpVeYe7v7oQ3x1ui+WajH667S6Apg2fATf5Bwir7w87Fqi3WQi3ZlQ29KHwiQEE/sP5PdnTWBN1kH++vXqoCdCgsXKM5OnMHlQw3PMUs1GFl4EHLMxmXAiuq5BKIkhr9OrlkDZA9QTM4B5JEqXxUi9CFkwHmhg+lF0B5nfoH3VhSJSnkHYL0T3rofi24PtxYFIm4+whp5FknoJ8sjZIN3H2OBAdP0coTR9QqChRExxOnpqGX5N47Ut3wX1lf0hBqASeHdn4zOoHk1rUOwAmpR8uj+TLw7sY/KgwajHjBUEkGi1RjRTJD3vEyTaGoO9H4PjitAXVjwX+jp1O7qaixAKjQ7dIhI7gIosfQzp+RY8Cwn+EZlAmJBVC0BxgXCB/3tQutR2s6R7BchQn1OCezkkzIjQlsjokIJfumsH3+fnhxR4W6BJyarMvSGfFn+/4EIskWy9rpcT3B8G0EBvYJ9TvSD8scJzkCiE/EE0F1kIngXhjAFZAZ6l1T9gCEhOB5GETH4aKv4BhEoT6AUZ/azIHcKX5kBJMc+u+5o/ffkZa7IO8t6O7bjV2O55pEkZNGiWwE1LF/PVwQONXi9sZ4MI475gO6uBCxuKWdUJ/Ijaqhsr6/yrV//5AjbIIii5BeTRMNcqYI1+H77dt/BLd+3ggU9WoUodVdd544fv6/Xb4w2vpvLzD5eyasbN9E5qYINe61iwTgDvV0C1H7xwgOMahLmBlBWOK6DqtajaHBsUsJzUGqW2Xyp8Ph74dBUeTa1dIIpnsdfg13Xe3v59g+cIIRApcxEpT4LtArBfjEh5EeG6v+HCE35Bix3Z44LkVgk5bHctvJSSd378gX9u/Ja88jLUOFjEaSqqrpMXYsr0WIRQwD4ZYQ/OuCX1SmTVf8D9IQg7wnk9OC5HKEnI9pZYMhTO1smsENeC96oqT3+zhre3b8OjqpzWqzeDU9N558dteNpBS94Q2aUlZJWW0De56RuSSelDFl0N6kEg4DIsy/4I/vUoyU8gzSNBjflSSMvwfIB03YUQ0ZVoXHdp7ly+jP99v5UKnw9V11mXfYj/btvS7sUOsCE3hwsX/If12cFbNDaKZwVo2dSIPYAb3MuR6n5E0iPRMjN26IeQJb+NerFxIXhN11m+Zzd3rVjGA5+uYuvhPPaXFPP1oSy82k9Tcx3gQV2LTsBN4XeffNRk3xrpXQsyREC3MIFvM8Ia/cFeTPCuRDY0BdsMYt6l0XSdW95/j015uVT5/ShC8P6uHVwyZBgWk4K3/TfmDXKkspKCykq6J4ZeOa1BSomsegsqXwT9SJiTVKSShPR+1QqWxgIJ+lEIs6rcHGIu+I/3ZdaKHUCXEreqsnT3zg4x19AYupTsLCzg1mWL2XGkgCSbjZtOOoXZp52OqU44nqz8F1S8ALjDF4YPSu4kTh7cUUCAqUdUS4z5N7Ny7+6QTlYWRWFgahq2SFYl2ykmIeibnMxty5bw45ECJFDq9TJv0wb+tPrz2vOk9EPlP2lY7LVnE9WV1FjimI4QkWUpi5SYCz7RZkMJE4N595jTuXrkCTGwqvVQgESLFYfZjMNsYX9JCdoxfXi3qrJw+zbKvNWOWHpRaH+Tjo5ehPRtjWqRMRf81SNPwBqiFVeEYNKAQTw68Vx6JESvDxdrBqam8cT5F/DrM8ahE+x+UIPFZCK7rNpZTUkFEfNb1fZ4VyGLZqK7V0WtyJh/i6O6dee+ceOxmUwkWqwkWq0k22y8PvVKsstK+fLAfjK6R7cfF0syi4sYmpbOjiNHQnblavCpKn2qXQ+EsILzZqB95HaPHhLwQPmjyBCxxc0hbvzhi91uvsk+hNNi4aTuPbjrow/YlJeLRVGo9PmIjd9j6yCAXi4XuQ3435uEYGbGyYHGwGwOzNJU/hsq54EsRWJGhPSm7IjYEV0/Qph6RXR2u/CHT3U4uGjIUAB+u2oFG3Jz8GlaUDhBR0ACOeXlDZ6jSclbP3xPQWUFz190acC3JvF2luacxZyvPmNw4gHmjf8Am6IR2uXkmG3w2jUBd+JoEPMuzbH4NY0P9uxqF05grY1HU/l0fyaHKwI/js8P7OPhz1Zw1F3F2oIeLM8aGEbSAqxtlzOndbGCbVLYCK+mEjctfA1+XUeLUeBGPGI1mckqLaW7PZ9u3llsmpqDJgWf5PYnwexFCdm6S/Cta2tTo4wZMIF1DCJ5TlRLjSucFgsDU9PYU1Q/MEARguNcSeRVVnSq1t+rqQxMFsij1zAsqQxFgAnJeb0OUOS1U6WacJpDfR+RbZwQ1yT/GWGbiHS/i1R3gXkUwnFZi1r7uOvSADw26bzagGsIDOASLFbmXXo55/Qf2ClWYAHsZjOXDR1OGitA+uq15laTjsvio0q14NNC3cb23n9Xwb0CeeQ8KP8buBdB+RPIwslIrfl54+NO8FJKXtm8MWgxalBaGoPS0nnx4sv448TWSzcXKyyKgtVkYnBqGlaTiTSHg1+cOoY5504GdTfB2QACvLIrg33lx+7YEf1ta2KCfyvIMn5aYXYHFqPKHm92kXHXpfk+/zBrsrLw1um2aFKy+2gha7IOMqFff0o9HiyKCb/evrs2AhjVrRun9epDst3OFcNH0CcpeLsZ3ZIBnpUc61qgCDizex7DU4oJtF22QMIm2RHmtuzV8a7HPql08K5udqlx18JvyM1BDSHkKr+f9TkB3/E9RUfbvdghcCszi4s5tVdv7hpzRkixAwjH5aAkUP922XA4TmHCiZ9A168RKXMh5QUCrXtsA9hbhgLYwX4ZYdvjFgSFxJ3guzidIV0N7GYz3RISADipR08c8ZqQtIlU+f0sbiQvjlASA2nwlH4ERJAIzusQaf9CCIFi6oqwX4Aw9wvkdGy3KGA7F5H+FkrK42C/ADg25Z61+sfQ7Brii8mDhtRzi63BJASXDT0egGnHjyTBasXUyvuKRoNILGzsc+hlT0PJ3aAfpDbVhpKIODYlh+IidI6X9oII/HAtge0qRdIjYB4IwgnYA/+aj0e47m12DXEneKfFwhvTptPblYTDbMZpsdA9IYHXpl5JqiNwg5NsNpZePYMLBw8lwWIhzeEI+VSINQowoktXRnTtxtn9+uO0BCcIdVosXDUivEeo9O+GqvkEBq01uV08UDHvp6y/NWg5UbQ+FpjAdk7tK6EkI9LfR6T+C5H0e0Tqq4j0hQglodk1xGW/YETXbqy+6Tb2FgU2JBia3iVo1qany8U/Lryk9vV9n6xkyc4fg7KNCcCkKOi63ub+ODqQWVLM0qtnMCQ9nW8OZXH7siVAIHOBogimDhvOpP4N5JnxfkroPrkO3s/AfGvtO9LzKaGnI2u+u3ieqlQg5bmgOXYhBFhPC/xFgbgUPAQ+6JD09IjPv/fM8XyTnUWR202V34/TbMFqNvGvSy9nVNfurMzcw5w1q8mvrIjo9kfLE0VKydeHDjIkPZ0zjuvL2ltn8dHePZT7fJzVtx/D0rs0XICwQsj0eApB/VthI3BLQ+R4DOdopgwAPfobAAdjBqyBz5P8BPi3B4LRpQds4yDx1yimpmdwaI4VHYJ0p5OPZ9zMir17+KEgn0FpaVw6dDiJ1btyHN+lG6XVARWhhGwGuiW6KKisQAImodDF6SS3omEnr9rrFSXkbiEmRcFl+ylqJ8lmZ3pTglpsF0D530Mfs19Q76WwX4SseD7EiYKAa/GxEVM20LMjt6VFmMBxESLpMYQwgX0SuGa3Ud0/EXd9+JZgM5u5fPjxPDRhIteOyqgVO8D87zc36JKgArkV5ahSokmJT9fIqyiPaGB8/sBBPHHe5JCRWwCTBw5u8mepQZj7QNIfCMyxO6pzR9og6bHafO8/nXscJD0aOI6z+s8GSX+tDoSue7trNi9oyRSmFUx9wXkrmBvbVM0L7kXIIxeiVy1rQZ0to0UtvBDiKeBSAlMDmcDNUsqGs9/HiOyy0rDRReGQBCKPFF0Pm4n4gkGDef7CSwMtudXGPSuX1wrfJAQvX3J5vRa+OSjOq5C2SeD9HBBgPyds3nTFeSXSPgm8XxIYBE4MZCOzjUaWPgK+rwJl2M4F78omWmIG1x8CdsjyQFY0x3SE4kQvaCDBa130A1D2ELosRkm4oYn1t5wWBYAIISYDn1XvAvIEgJTyvsaua4sNEY7llc0beeabr+ttQRkJTouFGzJOYnXWQTKLjtLF6WR83/5kdO/BOQMG0u2Y8EOP6mdDTg4mReG0Xr0jS43dhtREDgmhoBeMC5/yIxTmEShdloQ8pB8eDk2ZFhAuRLd1rbK1TasFgEgp6wYbrgN+1pLyWpPpI07g9S2bOVJVGbJrY1EUJAT1wzVd57oTTuTecRMiqsdutjC+X/8oWNw6iLqxsQl3QfmfCDugRSEgYjsICyL5ifAFmwaAlhm5IdIfyDkT5TQcjRHNQestwNtRLC+quGw23r9mBi9v2sCqfXuwmUwkWKxkl5WR7nRyy8mn8vqW78gsLqp9CjjMFqaPHBV2yb89o7tXQcUzDZxhC3R7lDQw9692yw0/iyKS7kcW/5wmpQhpoLzWotEujRDiEyDUz/BBKeXS6nMeBEYD02SYAo/Zp/XUgwcPtsTuVsGj+lmwbSvLdu8iwWJhRsZJTBk0pFXSNscS6d+JPDqdcB6YoIBIRnRd2aDIj0XPPyWw40ej2MF5LUrSAxGX3RQa6tK0OIhbCHEj8HPgXClDJTwMJhZ9eIOf0Et/D+73CN3ndoDtbETSfQhT76aVe3gYYRe+lC6B3fyEGRwzEa5fBaYnW4FW68MLIaYA9wFnRyp2gzhAyyWk2EUiIvkZhH1S88o1DwZ1T4j3hyLS3w/4touEqKfAbgotnYd/HnABHwshtgghXoqCTQatjXU8EGL/KOkDS0azixWuh0KUa0e4HgxkXVCSYyp2aPksTfNXVAxihnBOR1bNB72QnxaeHOC8HmGK3J0jqFzbGZA2H1nxXKClNw9BJN4ddo/WWNBhXAsMIkcoLuiyJJDYyfMJKMkI541gv6jlZVtPQqS9EgUrWwdD8J0UoaQiXL8D1+9ibUqbEpNUe0KII0Bj85JdgMI2MKcx4sGOeLAB4sOOSGzoJ6XsGupATAQfCUKIjeGmljqbHfFgQ7zY0VIbOpS3pIFBYxiCN+hUxLPg58XagGriwY54sAHiw44W2RC3fXgDg9Ygnlt4A4OoE9eCF0I8JYTYKYT4XgixWAjR5v6kQoirhBDbhRC6EKLNZyiEEFOEELuEEHuFEPe3df3VNrwqhCgQQvwQi/qrbThOCPG5EGJH9f34ZXPKiWvBAx8Do6SUGcBuoHX8SRvmB2Aa0PyEhs1EBNwJXwAuBEYA1wohRrS1HcDrwJQY1FsXFfiNlPJ44HTgzuZ8F3EteCnlKillTTjOOqBPDGzYIaXc1db1VjMG2Cul3Cel9AFvAVPb2ggp5WqgqK3rPcaGPCnld9X/Lwd2AE3zXybOBX8MtwArYm1EG9MbOFTndTbNuOuiF5QAAAEDSURBVMkdDSFEf+BkYH1Tr425L00TIqpUYEGsbIgRoUKtOvW0mhAiEXgXuEdKWdbU62MueCnleQ0dr46ouoRARFWr3OzGbIgh2cBxdV73AXJjZEvMEYEUB+8CC6SU7zWnjLju0tSJqLqsk0ZUbQCGCCEGCCGswDXA+zG2KSaIQGDxK8AOKeXfmltOXAueOIioEkJcIYTIBs4APhRCNDV7UbOpHrDPBlYSGKQtlFJub6v6axBCvAl8AwwTQmQLIW5t7JpWYBwwEzinWgtbhBBNduA3VloNOhXx3sIbGEQVQ/AGnQpD8AadCkPwBp0KQ/AGnQpD8AadCkPwBp0KQ/AGnYr/B/F3h16YA0yjAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "from sklearn.datasets.samples_generator import make_blobs\n",
    "import matplotlib.pyplot as plt\n",
    "from random import randint\n",
    "fig = plt.figure(1)\n",
    "plt.subplot(221)\n",
    "center = [[1, 1], [-1, -1], [1, -1]] #三个类簇的中心\n",
    "cluster_std = 0.35\n",
    "#生成一个具有三个类簇的数据集\n",
    "X1, Y1 = make_blobs(n_samples=1000, centers=center,\n",
    "                        n_features=2, cluster_std=cluster_std, random_state=1)\n",
    "#数据集可视化\n",
    "plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(2) 使用LVQ类，对上面还生成的数据集进行原型聚类。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#添加实现代码"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\textbf{7.6 密度聚类算法基本原理}$\n",
    "\n",
    "密度聚类算法从样本密度的角度来考察样本之间的可连接性，并基于可连接样本不断扩展聚类的簇，以获得最终的聚类结果，密度聚类算法的簇的个数不需要实现设定，而是在聚类的过程中形成的。\n",
    "\n",
    "DBSCAN是一种著名的密度聚类算法，它基于一组邻域参数来刻画样本分布的紧密程度。给定数据集$D=\\{\\textbf{x}_1,\\textbf{x}_2,…,\\textbf{x}_m\\}$，定义以下概念：\n",
    "\n",
    "(1)$\\varepsilon$-$\\textbf{邻域}$:对于$\\textbf{x}_j \\in D$，其$\\varepsilon$-邻域包含样本集中与$\\textbf{x}_j$的距离不大于$\\varepsilon$的所有样本.\n",
    "\n",
    "$N_{\\varepsilon}(\\textbf{x}_j)=\\{\\textbf{x}_i \\in D| dist(\\textbf{x}_j,\\textbf{x}_{i}) \\le \\varepsilon\\}$\n",
    "\n",
    "(2)$\\textbf{核心对象：}$如果样本$\\textbf{x}_j$的$\\varepsilon$-邻域至少包含$MinPts$个样本，则样本$\\textbf{x}_j$是一个核心对象.\n",
    "\n",
    "(3)$\\textbf{密度直达：}$若$\\textbf{x}_j$位于$\\textbf{x}_i$的$\\varepsilon$-邻域中，且$\\textbf{x}_i$是核心对象，则称$\\textbf{x}_j$通过$\\textbf{x}_i$密度直达.\n",
    "\n",
    "(4)$\\textbf{密度可达：}$对$\\textbf{x}_i$与$\\textbf{x}_j$，若存在样本序列$\\textbf{p}_1,\\textbf{p}_2,…,\\textbf{p}_n，\\textbf{p}_1=\\textbf{x}_i，pn =\\textbf{x}_j$，且$\\textbf{p}_{i+1}$由$\\textbf{p}_i$密度直达，则称$\\textbf{x}_j$由$\\textbf{x}_i$密度可达.\n",
    "\n",
    "(5)$\\textbf{密度相连：}$ 对$\\textbf{x}_i$与$\\textbf{x}_j$，若存在$\\textbf{x}_k$使得$\\textbf{x}_i$与$\\textbf{x}_j$均由$\\textbf{x}_k$密度可达，则称$\\textbf{x}_i$与$\\textbf{x}_j$密度相连.\n",
    "\n",
    "下图展示了密度聚类算法中定义的基本概念，其中黑色的点表示数据样本，红色的点表示核心对象，带箭头的实现表示样本之间密度可达。\n",
    "\n",
    "<img src=picture/cluster2.png>\n",
    "\n",
    "DBSCAN算法首先在数据集中任选一个核心对象为种子，再由此出发找出其密度可达的样本，生成相应的聚类簇，直到所有核心对象都被访问过为止，其算法流程如下所示：\n",
    "\n",
    "<img src=picture/cluster1.png>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\textbf{7.7 密度聚类算法实现}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "class DBSCAN:\n",
    "    #计算两个向量的欧氏距离\n",
    "    def euclidean_dist(self, x1, x2):\n",
    "        x = np.array(x1)\n",
    "        y = np.array(x2)\n",
    "        return np.linalg.norm(x - y)\n",
    "    #密度聚类\n",
    "    def fit(self,data,neighbor_dist,minpts):\n",
    "        \"\"\"\n",
    "        参数：\n",
    "        data: 数据集\n",
    "        neighbor_dist: 邻域半径\n",
    "        minpts:构成核心对象的邻域最小样本数\n",
    "        \"\"\"\n",
    "        #初始化核心对象集合为空\n",
    "        core_obj = set()\n",
    "        #初始化聚类个数为0\n",
    "        cluster_num = 0\n",
    "        #初始化聚类列表clusters\n",
    "        clusters = []\n",
    "        #初始化为访问样本集合P\n",
    "        P = set(data)\n",
    "        #搜索核心对象\n",
    "        for sample in data:\n",
    "            #如果样本sample的neighbor_dist邻域中样本数大于等于minpts\n",
    "            if len([ sample_i for sample_i in data if self.euclidean_dist(sample, sample_i) <= neighbor_dist]) >= minpts:\n",
    "                core_obj.add(sample)\n",
    "        #开始聚类\n",
    "        while len(core_obj):\n",
    "            P_OLD = P #记录当前未访问的样本集合\n",
    "            #从核心对象集合中随机选取一个核心对象\n",
    "            obj = list(core_obj)[np.random.randint(0, len(core_obj))]\n",
    "            #从未访问样本集合中删除核心对象obj\n",
    "            P = P - set(obj)\n",
    "            #初始化队列Q\n",
    "            Q = []\n",
    "            Q.append(obj)\n",
    "            while len(Q):\n",
    "                q = Q[0]#取队首元素q\n",
    "                #获取q的neighbor_dist邻域\n",
    "                N_q = [sample for sample in data if self.euclidean_dist(sample, q) <= neighbor_dist]\n",
    "                #如果领域中的样本数大于等于minpts\n",
    "                if len(N_q) >= minpts:\n",
    "                    delta = set(N_q) & P\n",
    "                    Q += (list(delta))#将delta加入队列\n",
    "                    P = P - delta #从未访问样本集合中删除delta\n",
    "                Q.remove(q)#删除队首元素\n",
    "            #生成类簇C_k\n",
    "            cluster_num = cluster_num + 1\n",
    "            C_k = list(P_OLD - P)\n",
    "            core_obj = core_obj - set(C_k)\n",
    "            clusters.append(C_k)\n",
    "        return clusters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "#数据集：每三个是一组分别是西瓜的编号，密度，含糖量\n",
    "data = [(0.697,0.460),(0.774,0.376),(0.634,0.264),(0.608,0.318),(0.556,0.215),\n",
    "        (0.403,0.237),(0.481,0.149),(0.437,0.211),(0.666,0.091),(0.243,0.267),\n",
    "        (0.245,0.057),(0.343,0.099),(0.639,0.161),(0.657,0.198),(0.360,0.370),\n",
    "        (0.593,0.042),(0.719,0.103),(0.359,0.188),(0.339,0.241),(0.282,0.257),\n",
    "        (0.748,0.232),(0.714,0.346),(0.483,0.312),(0.478,0.437),(0.525,0.369),\n",
    "        (0.751,0.489),(0.532,0.472),(0.473,0.376),(0.725,0.445),(0.446,0.459)]\n",
    "#绘制聚类结果\n",
    "def draw(C):\n",
    "    colValue = ['r', 'y', 'g', 'b', 'c', 'k', 'm']\n",
    "    for i in range(len(C)):\n",
    "        coo_X = []    #x坐标列表\n",
    "        coo_Y = []    #y坐标列表\n",
    "        for j in range(len(C[i])):\n",
    "            coo_X.append(C[i][j][0])\n",
    "            coo_Y.append(C[i][j][1])\n",
    "        plt.scatter(coo_X, coo_Y, marker='x', color=colValue[i%len(colValue)], label=i)\n",
    "\n",
    "    plt.legend(loc='upper right')\n",
    "    plt.show()\n",
    "#此处添加代码，使用上面实现的DBSCAN类对data数据进行密度聚类，并适应draw函数可视化聚类结果"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\textbf{7.8 层次聚类}$\n",
    "\n",
    "层次聚类是在不同层次对数据集进行划分，从而产生树形的聚类结构。AGNES是一种自底向上聚合的层次聚类算法，它先将数据集的每个样本看成一个初始的聚类簇，然后在算法的每一步中找出距离最近的两个类簇进行合并，该过程不断重复，直到达到预设的聚类簇个数。层次聚类的聚类过程示意图如下所示。\n",
    "\n",
    "<img src=picture/cluster3.png>\n",
    "\n",
    "AGNES算法在对两个类簇进行合并的时候，需要计算两个类簇的距离，而每个类簇是一个集合，因此需要采用关于集合的距离计算方法。给定任意两个聚类簇$C_i$和$C_j$，它们之间的距离可以采用以下方式计算：\n",
    "\n",
    "(1)$\\textbf{最小距离：}$\n",
    "\n",
    "$d_{min}(C_i,C_j)=\\underset{x\\in C_i,y \\in C_j}\\min dist(x,y)$.\n",
    "\n",
    "(2)$\\textbf{最大距离：}$\n",
    "\n",
    "$d_{max}(C_i,C_j)=\\underset{x\\in C_i,y \\in C_j}\\max dist(x,y)$.\n",
    "\n",
    "(3)$\\textbf{平均距离：}$\n",
    "\n",
    "$d_{avg}(C_i,C_j)=\\frac{1}{|C_i|}\\frac{1}{C_j}\\sum_{x\\in C_i}\\sum_{y\\in C_j}dist(x,y)$.\n",
    "\n",
    "$\\textbf{AGNES算法的流程如下所示：}$\n",
    "\n",
    "<img src=picture/cluster4.png>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\textbf{7.9 层次聚类算法的实现}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "class AGNES:\n",
    "    #计算两个向量的欧氏距离\n",
    "    def euclidean_dist(self, x1, x2):\n",
    "        x = np.array(x1)\n",
    "        y = np.array(x2)\n",
    "        return np.linalg.norm(x - y)\n",
    "    #计算两个集合的最小距离   \n",
    "    def dist_min(self,C_i,C_j):\n",
    "        min_dist = np.inf\n",
    "        #添加代码，计算集合C_i和C_j的最小距离min_dist\n",
    "        return min_dist\n",
    "    #计算两个集合的最大距离\n",
    "    def dist_max(self,C_i,C_j):\n",
    "        max_dist = -np.inf\n",
    "         #添加代码，计算集合C_i和C_j的最大距离max_dist\n",
    "        return max_dist\n",
    "    #计算两个集合的平均距离\n",
    "    def dist_avg(self,C_i,C_j):\n",
    "        avg_dist = 0.0\n",
    "         #添加代码，计算集合C_i和C_j的平均距离avg_dist\n",
    "        return avg_dist\n",
    "    #寻找最小距离及其下标\n",
    "    def find_min_dist_pos(self,dist_matrix):\n",
    "        min_dist =  np.inf\n",
    "        x = 0\n",
    "        y = 0\n",
    "        for i in range(len(dist_matrix)):\n",
    "            for j in range(len(dist_matrix[i])):\n",
    "                if i!=j and dist_matrix[i][j] < min_dist:\n",
    "                    min_dist = dist_matrix[i][j]\n",
    "                    x = i\n",
    "                    y = j\n",
    "        return (x,y,min_dist)\n",
    "    def cal_dist(self, dist_type,x,y):\n",
    "        dist = 0.0\n",
    "        if dist_type == 'min':\n",
    "            dist = self.dist_min(x,y)\n",
    "        elif dist_type == 'max':\n",
    "            dist = self.dist_max(x,y)\n",
    "        else:\n",
    "            dist = self.dist_avg(x,y)\n",
    "        return dist\n",
    "    #层次聚类\n",
    "    def fit(self,data,dist_type, k):\n",
    "        \"\"\"        \n",
    "        参数：\n",
    "        data: 数据集\n",
    "        dist_type: 'min','max','avg'之一\n",
    "        k: 聚类个数\n",
    "        \"\"\"\n",
    "        #初始化聚类划分clusters和距离矩阵dist_matrix\n",
    "        clusters = []\n",
    "        dist_matrix = []\n",
    "        #初始时，每个样本是一个簇\n",
    "        for sample in data:\n",
    "            C_i = []\n",
    "            C_i.append(sample)\n",
    "            clusters.append(C_i)\n",
    "        for x in clusters:\n",
    "            m_i = []\n",
    "            for y in clusters:\n",
    "                dist = self.cal_dist(dist_type, x, y)\n",
    "                m_i.append(dist)\n",
    "            dist_matrix.append(m_i)\n",
    "        #设置当前聚类簇个数\n",
    "        q = len(data)\n",
    "        #合并、更新\n",
    "        while q > k:\n",
    "            x,y,min_dist = self.find_min_dist_pos(dist_matrix)\n",
    "            #合并两个集合C_x,C_y\n",
    "            clusters[x].extend(clusters[y])\n",
    "            #删除集合C_y,因为它已经合并\n",
    "            clusters.remove(clusters[y])\n",
    "            #更新距离矩阵\n",
    "            dist_matrix = []\n",
    "            for x in clusters:\n",
    "                m_i = []\n",
    "                for y in clusters:\n",
    "                    dist = self.cal_dist(dist_type, x, y)\n",
    "                    m_i.append(dist)\n",
    "                dist_matrix.append(m_i)\n",
    "            q = q - 1\n",
    "        return clusters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXsAAAD4CAYAAAANbUbJAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAbJklEQVR4nO3df2wc533n8ffXWitCHbnM6cclIqWKCe1Yls8WLFqKASNSDjFk+65R/AOonOCC1EENF3Fvj0YPMQ44Z+/SA9I/Unpxds8QGiNA/rBQoG3ia2Xp4BpyDsmdIrpRUluJLjxTDrn6Q46Sbew2PGvF7/2xO+TsckkOl7M7OzOfF0BISw53n9kfHz7znWeex9wdERHJtmuSboCIiHSfwl5EJAcU9iIiOaCwFxHJAYW9iEgOFJJ64M2bN/vOnTuTengRkVR67bXXfu7uW1b7e4mF/c6dO5mYmEjq4UVEUsnM3urk9yKVcczsHjM7b2aTZvZkm58fNLN/MLOzja+nOmmMiIh0x4o9ezNbBzwL3A3MAGfM7EV3P9ey6f9093/dhTaKiMgaRenZ7wMm3f1Nd38POAYc7m6zREQkTlFq9oPAdOj2DLC/zXZ3mtkPgYvAH7r7G60bmNmjwKMAO3bsWH1rRURicOXKFWZmZpidnU26KUvasGEDQ0NDXHvttbHcX5Swtzbfa51Q5++A33L3d83sPuBbwA2Lfsn9KHAUYHR0VJPyiEgiZmZm2LhxIzt37sSsXcRF4A7h3229vQbuzuXLl5mZmWF4eDiW+4xSxpkBtoduD1HvvYcb9it3f7fx/+PAtWa2OZYWiojEbHZ2lk2bNnUe9BcvwvR0PeCh/u/0dP37MTAzNm3aFOuRR5SwPwPcYGbDZrYeOAK82NKwD1rjWTOzfY37vRxbK0VEYramHn2tBpcuLQT+9HT9dq228AcgqfYtYcUyjrvXzOxx4CSwDnje3d8ws8caP38OeAj4fTOrAb8GjrjmThaRLDKD7Y1ix6VL9S+ArVvr3485pOMSaZy9ux939xvd/SPu/l8a33uuEfS4+zPuvtvdb3P3j7n797rZaBGRRIUDP7DKoD9x4gQf/ehHGRkZ4atf/WrMDVxMc+OIiKyktVAxN1cv3YSFa/gruHr1Kl/84hd56aWXOHfuHC+88ALnzrVeuhQvhb2IyHJKJRgbWwjyuTn4vd+DP/qjeulm7976v+Ea/gq+//3vMzIywoc//GHWr1/PkSNH+Pa3v93V3VDYi4gsxR2qVSiXFwL/iSfg+efh6lUYGloo6WzdCoVCpFJOpVJhe6gMNDQ0RKVS6eKOJDgRmkhWdHG4tSTNDMbH6/8vl+tfAMUi/MmfwDXXLGy3ipp9u/ErcY++aaWevcgatB7hu9dvl0pJtkpiFQ78wPj4QtCHt4toaGiI6VDNf2Zmhm3btq2llStS2It0qN0R/thY/Xa1Gttwa0la8MKGhf/Cd+COO+7gpz/9KVNTU7z33nscO3aMT33qU2ts6PJUxhHp0HJH+OPjKuVkQvgvePDCBreh4xe6UCjwzDPPcOjQIa5evcojjzzC7t27Y258y2N29d5FMi4I/OCzDwr6TDGDgYHmv+DBX/iBgTW90Pfddx/33XdfLM2MQmEvsgZLHeEr8HukF2fHS6Xm+w0CP2UvsGr2kqjWsmea6tytR/hzc/V/wzV86aJenh1vDfaUBT0o7CVBaR/JstQRfrG45iN8WYnOjq+ayjiSiPBnFZrPexWL6RmrnpEj/PTR2fFVU9hLIrL0Wc3AEX466ez4qqiMI4lZ6loVfVYlki6Mf88yhb0kRp9V6VgGzo4/8sgjbN26lVtuuaUnj6ewl0Rk4LMqSYrr7HjE4WCtc9nEsTbT5z//eU6cOLHm+4lKNXtJRBevVZG8WOvZ8YsX68sIBhOYBcsLFgoQmqdmaqpErVZlZGQcM8PdmZwco1AYYHi41HHzP/7xj3PhwoWOf3+1FPaSGI1kkTVby9nxYB1ZqAd+sI7s1q3zb0x3p1arUqnUTwKPjIwzOTlGpVJmcLCIu3d9tsq4KOwlUXGNZNE0w7JqEdaRNTNGRuqHnJVKeT70BweL8z39tFDNXlIv7RdnSUIiriMbDvxA2oIeFPaScrqQUjoW1OjD2iwrGNTowyYnx2I5SdtLCntJtfAgjHK5vp5EeDbalHW+pJfCNfol1pENgj6o0R84MMfgYJFKpbzmwH/44Ye58847OX/+PENDQ3z961+Pa8/aUs1eUk8XUkpHCoXmGn1Q0gmtI2tmFAoDTTX6oKRTKAysqZTzwgsvrHkXVkNhL6mnaYalI9u2LR4O1qZmPzxcahp1EwS+avYiPaSLs2RNIg4Haw32tAU9qGcvKaeLs0SiUdhL6uniLJGVqYwjmaBphkWWp7AXEckBhb2ISI9NT0/ziU98gl27drF7927K4XHDXaKavYjIClonPFvrBGiFQoGvfe1r3H777bzzzjvs3buXu+++m5tvvjmO5ralnr2IyDJKp0qMnVy4WtbdGTs5RulUqeP7/NCHPsTtt98OwMaNG9m1axeVSiWO5i5JYS8isgR3pzpbpXy6PB/4YyfHKJ8uU52txjI/zoULF/jBD37A/v37Y2jx0lTGERFZgpkxfqh+4Ub5dJny6Xptvbi/yPihtV9F++677/Lggw/y9NNPc/3116+5vctRz15EZBnhwA/EEfRXrlzhwQcf5LOf/SwPPPDAmu4rikhhb2b3mNl5M5s0syeX2e4OM7tqZg/F10QRkeQEpZuwcA2/0/v8whe+wK5du3jiiSfW2sRIVgx7M1sHPAvcC9wMPGxmi04ZN7b7Y+Bk3I0UEUlCuEZf3F9k7qk5ivuLTTX8Tnz3u9/lm9/8Jq+88gp79uxhz549HD9+PObWN4tSs98HTLr7mwBmdgw4DJxr2e4PgL8A7oi1hSIiCTEzBjYMNNXog5LOwIbOpzi+6667er74SZSwHwTCy7nMAE2njc1sELgf+Jco7EUkQ0oHF09xHEfNvtei1Ozb7VHrn6SngS+5+9Vl78jsUTObMLOJt99+O2ITRUSSlZcpjmeA8Kq8Q8DFlm1GgWONJ2AzcJ+Z1dz9W+GN3P0ocBRgdHRUM42LiPRIlLA/A9xgZsNABTgCfCa8gbsPB/83s28Af90a9CIikpwVw97da2b2OPVRNuuA5939DTN7rPHz57rcRhERWaNIV9C6+3HgeMv32oa8u39+7c0SEZE46QpaEZEem52dZd++fdx2223s3r2bL3/5y11/TM2NI6kVXoqw3W2RuMT2Xmv84vve9z5eeeUV3n/ddVyp1bjrrru49957+djHPhZbm1upZy+pVCrB2Fj9swP1f8fG6t8XiVNs77WLF2F6Ghpj9t9/3XUwPc2Vn/2MK1eudH04p8JeUscdqlUolxc+hGNj9dvV6sKHUmStYnuvuUOtBpcuzQf+1QsX2PPJT7L11lu5+5Of1BTH0l6eSxhmMN6YhLBcrn8BFIv173f7ecjzc583sb3XzGB743KlS5fg0iXWAWdffpnqxo3c/8ADvP7669xyyy1x78I89exTSCWM5g9hoBdBr+c+f2J7r4UDP7B9OwMf+AAHDx7kxIkTa2rnShT2KaMSRl2w32HhEO7WY+q5z5/Y3mvu9RIO8PYvf0n1nXdgeppf/9M/8fLLL3PTTTfF0+ClH98T+dq7d69LZ+bm3ItF9/q7p/5VLNa/nwfh/Q/2u/V2Lx47j899Vpw7dy7SdrG91+bm3N96y/3MGfe33vIfnj3re26+2f/FyIjvvvFG/0+lUuR2AhPeQeaaJ9QdGR0d9YmJiUQeOwvc4ZrQcdncXL7qxqVSvTcdHE4Hva+Bge6XVPL+3GfBj3/8Y3bt2hVp29jeaxcv1k/Sbt++cEfT01AowLZtkdtpZq+5++gqHhnQCdpUWuqwshc1635RKjWfGA3qqr04OZv35z5vYnuvbdu2+I6C4O8B1exTJlwnLhbrvcpisbmOnBetn5FeBb2e+/yJ7b3W6zdtiHr2KWNWP3wMD/0KRgoMDKh32U167rPFQwuS9KO4S+yq2aeUxnonR899+k1NTbFx40Y2bdrUl4Hv7ly+fJl33nmH4eHhpp+pZp8zCR4N5p6e+/QbGhpiZmaGfl4xb8OGDQwNDcV2fwp7Ecm2Nodi11577aIec9bpBK2IZJcueZ6nsBeRbNIlz01UxhGRbEp6xrw+o9E4IpJtGbvkudPROCrjiEh2JTFjXp9S2ItINumS5yaq2YtINumS5yaq2YtItmXskmfV7EVE2tElz4DCXkQkFxT2IiI5oLAXEckBhb1IzrQOykhqkIb0lsJeJEempkpMTo7NB7y7Mzk5xtRUKdmGSdcp7EVywt2p1apUKuX5wJ+cHKNSKVOrVdXDzzhdVCWZk7Fh1bExM0ZG6hcVVSplKpX6xGCDg0VGRsb7csUmiY969pIpmr58eeHADyjo80FhL5mh6ctXFpRuwsI1fMkulXEkMzR9+fLCNfqgdBPcBvXws049e8mUcOAHFPR1ZkahMNBUox8ZGWdwsEihMKCgzzj17CVTlpq+XIFfNzxcwt3ngz0IfAV99kXq2ZvZPWZ23swmzezJNj8/bGY/MrOzZjZhZnfF31SR5Wn68mhag11Bnw8r9uzNbB3wLHA3MAOcMbMX3f1caLO/BV50dzezW4E/B27qRoNFlqLpy0WWFqWMsw+YdPc3AczsGHAYmA97d383tP11gPpQkohSqXlcfRD4CnrJuyhlnEFgOnR7pvG9JmZ2v5n9BPgb4JF2d2RmjzbKPBNvv/12J+0VWZGmLxdZLErYt/uoLOq5u/tfuftNwKeBr7S7I3c/6u6j7j66ZcuWVTVUREQ6FyXsZ4DtodtDwMWlNnb37wAfMbPNa2ybiIjEJErYnwFuMLNhM1sPHAFeDG9gZiPWOKVvZrcD64HLcTdWREQ6s+IJWnevmdnjwElgHfC8u79hZo81fv4c8CDwOTO7Avwa+B3X9dciIn3Dksrk0dFRn5iYSOSxRUTSysxec/fR1f6epksQEckBhb2ISA4o7EVEckBhLyKSA6kK+9aTyRrwIyISTWrCvnSqxNjJhRV13J2xk2OUTpWSbZiISAqkIuzdnepslfLp8nzgj50co3y6THW2qh6+SI/o6Dq9UrF4iZkxfqg+V235dJny6foyasX9RcYPaeEFkV6YmipRq1XnFzsJljksFAYYHi4l3TxZQSp69tAc+AEFvUhvuDu1WpVKpTy/QHmwfm2tpqPrNEhN2Aelm7BwDT+O+1/utkiehderrVTKvPrqNU0Ll6vT1f9SEfbhGn1xf5G5p+Yo7i821fDXQid/RVYWBH6Ygj49UhH2ZsbAhoGmGv34oXGK+4sMbBhY05tNJ39FoglKN2FBSUf6X6omQnP3pmBvvd2pcMAHdPJXZEG4Rh+Ublpv67PSG7mYCK31zRTXm0snf0WWZ2YUCgNNwR7U8AuFtR1dS2+kYuhlty118leBL7JgeLjUdDQdBL4+I+mQqp59N3T75K9IlnTr6Fq6L/c9+6VO/gJrPvm7Gt06HyEiAik7QdtNSYZt6VSJ6mx1/o9NcLQxsGGA0sFST9ogIumQixO03ZTU4Wk/D/3UhWYi2ZH7Mk7S+nXeHx1tiGSLevZ9oN+Gfvbz0YZI0tJ6xKuefR/ot6Gf/Xq0IZK0NM/8qZ59wvp16Ge/HW2IJC3tM3+qZ5+wfhn62arfjjZEkhaeCK5SKVOp1I940zJdhIZe9ol+GmfferQxfmh80e1+f2OLdIu78+qrC0WRAwfmevp56HTopXr2faKfrkzs16MNkaQtNfOnevbLUM++//XT0YZI0vpl5k/17CV2/XS0kUX6Y5ouS838CaRi5k+FvUgC0jyEL8/SPPOnhl5Kz6T1YpS4pX0IX96l9YhXPXvpCU2/sCDtQ/gkndSzl67T9AuLafFu6TX17KXrNP3CYmkewifppJ699ISmX1jQOoTvwIE5BgeLTTV8kbgp7KUnlpp+IY/BpsW7JQmRwt7M7jGz82Y2aWZPtvn5Z83sR42v75nZbfE3VdKqXyd7S9LwcKmpZBMEfhaHXWoUVn9YsWZvZuuAZ4G7gRngjJm96O7nQptNAQfc/Zdmdi9wFNjfjQZL+mj6hfbSOoRvNXQ9Qf+IcoJ2HzDp7m8CmNkx4DAwH/bu/r3Q9v8bGIqzkZJ+pYOLL0bJa80+L8LXEwCLphfQFcO9FSXsB4Hp0O0Zlu+1fwF4qd0PzOxR4FGAHTt2RGyiZEUeerKyQNcT9JcoNft2r0jbopuZfYJ62H+p3c/d/ai7j7r76JYtW6K3UhZRHVTSQNcT9I8oYT8DbA/dHgIutm5kZrcCfwYcdvfL8TRP2imdKjWd2AxOgJZOlZJtmEiLpa4nUOek96KE/RngBjMbNrP1wBHgxfAGZrYD+Evg37j7/4m/mRLQ1aiSFrqeoL+sWLN395qZPQ6cBNYBz7v7G2b2WOPnzwFPAZuAP20cntU6mW9ZVqarUSUt0j4lcNZo8ZKUcneu+c8LB2ZzT/V2aTSRqDRvf7w6XbxEV9CmkK5GlTTRKKz+oLBPGV2NKiKd0KyXKaOrUUWkE6rZp5TqoCL5pJp9zqgOKiKrobAXEckBhb2ISA4o7EVk1TQ3U/oo7EVkVaamSk3THQTTIkxNlZJtmCxLYS8ikYXnqA8CP5j/plbT3Ez9TOPsRSQyzVGfXurZi8iqaI76dFLYi8iqaI76dFLYS6ZolEh3aY769FLNXjJjaqpErVadLykEwVQoDDA8XEq6eZmgOerTS2EvmRAeJQL1GnK4B6q5g+IzPFxqej6DwNfz298U9pIJGiXSW5qbKX1Us5fM0CgRkaUp7CUzNEpEZGkKe8kEjRIRWZ5q9pIJGiUisjyFvWSGRomILE1lHMkUjRIRaU9hLyKSAwp7EZEcUNiLiOSAwl5EJAcU9iIiOaCwFxHJAYW9iEgOKOxFRHJAYS8ikgMKexGRHFDYi0hu5HmNYoW9iOTC1FSpabrrYFrsqalSsg3rkUhhb2b3mNl5M5s0syfb/PwmM/tfZvb/zOwP429mvuS59yHSDeE1ioPAD9Y/qNWqufiMrTjFsZmtA54F7gZmgDNm9qK7nwtt9gvg3wKf7kYj86R0qkR1tsr4ofrUvO7O2MkxBjYMUDpYSrp5IqmkNYqj9ez3AZPu/qa7vwccAw6HN3D3S+5+BrjShTbmhrtTna1SPl1m7OTYfNCXT5epzuaj9yHSLXlfozjK4iWDwHTo9gywv5MHM7NHgUcBduzY0cldZJqZMX6o/mYsny5TPl3vfRT3F+d7+iJZEF5kpt3tbj1muzWK8xL4UXr27Z6FjrqY7n7U3UfdfXTLli2d3EXmhQM/oKCXLEniRKnWKI4W9jPA9tDtIeBid5ojQekmLCjpiKRdUidKl1qjeHCwmJs1iqOUcc4AN5jZMFABjgCf6Wqrcipcow9KN8FtUA9f0i/JE6V5X6N4xbB395qZPQ6cBNYBz7v7G2b2WOPnz5nZB4EJ4Hpgzsz+HXCzu/+qe03PHjNjYMNAU40+KOkMbMhH70OyLwjZIOihdydK87xGsSVVHhgdHfWJiYlEHrvfJXHySqRXwqWbQJ6GQK6Vmb3m7qOr/T1dQduH8tz7kGzTidLkRKnZi4jEYqkTpUBuTpQmRWEvIj2V9xOlSVEZR0R6TqXK3lPYi4jkgMJeRCQHFPYiIjmgsBcRyQGFvYhIDqQy7LWSk4jI6qQu7EunSk2zQAaTh5VOlZJtmIhIH0tV2GslJxGRzqTqClqt5CQi0plU9exBKzmJiHQidWGvlZxERFYvVWHfupLT3FNzFPcXm2r4IiKyWOpq9lrJSWRpWvhGlpLKlar0hhZZbGqqRK1WnZ8uOFgopFAYYHi4lHTzJCa5WqlK06OKNHN3arVq04pPwYpQtZqGJUvKyjgi0l54xadKpTy/vqvWdpVAKnv2IrJYOPADCnoJKOxFMiIo3YRpEW8JKOxFMiBcox8cLHLgwByDg8WmGr7km2r2IhlgZhQKA001+qCkUyhoWLIo7EUyY3i41DQMOQh8Bb2AyjgimaJhybIUhb2ISA4o7EVEckBhLyKSAwp7EZEcSGwiNDN7G3grkQePx2bg50k3osvysI+Qj/3UPmbHR91942p/KbGhl+6+JanHjoOZTXQy81ya5GEfIR/7qX3MDjPraLpglXFERHJAYS8ikgMK+84dTboBPZCHfYR87Kf2MTs62s/ETtCKiEjvqGcvIpIDCnsRkRxQ2K/AzO4xs/NmNmlmT7b5+WEz+5GZnTWzCTO7K4l2rsVK+xja7g4zu2pmD/WyfXGI8DoeNLN/aLyOZ83sqSTauVZRXsvGvp41szfM7NVet3GtIryW/z70Or7eeM/+syTa2qkI+/ibZvbfzeyHjdfxd1e8U3fX1xJfwDrg/wIfBtYDPwRubtnm/Syc+7gV+EnS7Y57H0PbvQIcBx5Kut1deB0PAn+ddFt7sJ8DwDlgR+P21qTbHfc+tmz/28ArSbe7C6/jfwD+uPH/LcAvgPXL3a969svbB0y6+5vu/h5wDDgc3sDd3/XGMw5cB6TtjPeK+9jwB8BfAJd62biYRN3HtIuyn58B/tLdfwbg7ml7PVf7Wj4MvNCTlsUnyj46sNHqc1i/n3rY15a7U4X98gaB6dDtmcb3mpjZ/Wb2E+BvgEd61La4rLiPZjYI3A8818N2xSnS6wjc2TgsfsnMdvemabGKsp83Ah8ws1Nm9pqZfa5nrYtH1NcSM/sN4B7qnZQ0ibKPzwC7gIvA3wNFd59b7k4V9strt/LDop67u/+Vu98EfBr4SrcbFbMo+/g08CV3v9r95nRFlH38O+C33P024L8C3+p2o7ogyn4WgL3AvwIOAf/RzG7sdsNiFOkz2fDbwHfd/RddbE83RNnHQ8BZYBuwB3jGzK5f7k4V9subAbaHbg9R/0valrt/B/iImW3udsNiFGUfR4FjZnYBeAj4UzP7dE9aF48V99Hdf+Xu7zb+fxy4NmWvI0R7LWeAE+7+j+7+c+A7wG09al8cVvOZPEL6SjgQbR9/l3o5zt19EpgCblr2XpM+GdHPX9R7QW8CwyycKNndss0ICydobwcqwe00fEXZx5btv0H6TtBGeR0/GHod9wE/S9PruIr93AX8bWPb3wBeB25Juu1x7mNju9+kXse+Luk2d+l1/G9AqfH/f97Inc3L3a8WHF+Gu9fM7HHgJPUz5M+7+xtm9ljj588BDwKfM7MrwK+B3/HGK5AGEfcx1SLu40PA75tZjfrreCRNryNE2093/7GZnQB+BMwBf+buryfX6tVZxfv1fuB/uPs/JtTUjkXcx68A3zCzv6de9vmS14/UlqTpEkREckA1exGRHFDYi4jkgMJeRCQHFPYiIjmgsBcRyQGFvYhIDijsRURy4P8DK9CVBB8Dg2wAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "#数据集：每三个是一组分别是西瓜的编号，密度，含糖量\n",
    "data = [(0.697,0.460),(0.774,0.376),(0.634,0.264),(0.608,0.318),(0.556,0.215),\n",
    "        (0.403,0.237),(0.481,0.149),(0.437,0.211),(0.666,0.091),(0.243,0.267),\n",
    "        (0.245,0.057),(0.343,0.099),(0.639,0.161),(0.657,0.198),(0.360,0.370),\n",
    "        (0.593,0.042),(0.719,0.103),(0.359,0.188),(0.339,0.241),(0.282,0.257),\n",
    "        (0.748,0.232),(0.714,0.346),(0.483,0.312),(0.478,0.437),(0.525,0.369),\n",
    "        (0.751,0.489),(0.532,0.472),(0.473,0.376),(0.725,0.445),(0.446,0.459)]\n",
    "#绘制聚类结果\n",
    "def draw(C):\n",
    "    colValue = ['r', 'y', 'g', 'b', 'c', 'k', 'm']\n",
    "    for i in range(len(C)):\n",
    "        coo_X = []    #x坐标列表\n",
    "        coo_Y = []    #y坐标列表\n",
    "        for j in range(len(C[i])):\n",
    "            coo_X.append(C[i][j][0])\n",
    "            coo_Y.append(C[i][j][1])\n",
    "        plt.scatter(coo_X, coo_Y, marker='x', color=colValue[i%len(colValue)], label=i)\n",
    "\n",
    "    plt.legend(loc='upper right')\n",
    "    plt.show()\n",
    "#添加代码，使用上面实现AGNES类对data数据进行层次聚类，并使用draw函数可视化聚类结果"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\textbf{7.10 实践任务}$\n",
    "\n",
    "新闻种类繁复多样，可以分为财经、体育、科技、娱乐等题材，本次实践任务要求根据提供的新闻数据集(详见文件“新闻.csv”)，综合应用中文分词、文本向量化以及本章介绍的聚类算法，构建新闻聚类分群模型。具体要求如下：\n",
    "\n",
    "(1)读取新闻数据文件，并对数据进行预处理；\n",
    "\n",
    "(2)在新闻数据集上，分别使用k-means，DBSCAN, AGNES聚类算法构建新闻聚类模型，并比sklearn库中的对应模型比较聚类的结果。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#添加代码，完成实践任务"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
